SuperMinHash - A New Minwise Hashing Algorithm for Jaccard Similarity Estimation

نویسنده

  • Otmar Ertl
چکیده

Œis paper presents a new algorithm for calculating hash signatures of sets which can be directly used for Jaccard similarity estimation. Œe new approach is an improvement over the MinHash algorithm, because it has a beŠer runtime behavior and the resulting signatures allow a more precise estimation of the Jaccard index.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

b-Bit Minwise Hashing for Estimating Three-Way Similarities

Computing1 two-way and multi-way set similarities is a fundamental problem. This study focuses on estimating 3-way resemblance (Jaccard similarity) using b-bit minwise hashing. While traditional minwise hashing methods store each hashed value using 64 bits, b-bit minwise hashing only stores the lowest b bits (where b ≥ 2 for 3-way). The extension to 3-way similarity from the prior work on 2-way...

متن کامل

BagMinHash - Minwise Hashing Algorithm for Weighted Sets

Minwise hashing has become a standard tool to calculate signatures which allow direct estimation of Jaccard similarities. While very ecient algorithms already exist for the unweighted case, the calculation of signatures for weighted sets is still a time consuming task. BagMinHash is a new algorithm that can be orders of magnitude faster than current state of the art without any particular rest...

متن کامل

Approximately Minwise Independence with Twisted Tabulation

A random hash function h is ε-minwise if for any set S, |S| “ n, and element x P S, Prrhpxq “ minhpSqs “ p1 ̆ εq{n. Minwise hash functions with low bias ε have widespread applications within similarity estimation. Hashing from a universe rus, the twisted tabulation hashing of Pǎtraşcu and Thorup [SODA’13] makes c “ Op1q lookups in tables of size u1{c. Twisted tabulation was invented to get good ...

متن کامل

b-Bit Minwise Hashing for Large-Scale Linear SVM

Linear Support Vector Machines (e.g., SVM, Pegasos, LIBLINEAR) are powerful and extremely efficient classification tools when the datasets are very large and/or highdimensional, which is common in (e.g.,) text classification. Minwise hashing is a popular technique in the context of search for computing resemblance similarity between ultra high-dimensional (e.g., 2) data vectors such as document...

متن کامل

Optimal Densification for Fast and Accurate Minwise Hashing

Minwise hashing is a fundamental and one of the most successful hashing algorithm in the literature. Recent advances based on the idea of densification (Shrivastava & Li, 2014a;c) have shown that it is possible to compute k minwise hashes, of a vector with d nonzeros, in mere (d + k) computations, a significant improvement over the classical O(dk). These advances have led to an algorithmic impr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1706.05698  شماره 

صفحات  -

تاریخ انتشار 2017